Skip to content

NO-JIRA: DownStream Merge [02-27-2026]#3011

Merged
openshift-merge-bot[bot] merged 91 commits intoopenshift:masterfrom
jluhrsen:d/s-merge-02-27-2026
Mar 22, 2026
Merged

NO-JIRA: DownStream Merge [02-27-2026]#3011
openshift-merge-bot[bot] merged 91 commits intoopenshift:masterfrom
jluhrsen:d/s-merge-02-27-2026

Conversation

@jluhrsen
Copy link
Copy Markdown
Contributor

📑 Description

Fixes #

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

danwinship and others added 30 commits January 19, 2026 15:18
Signed-off-by: Dan Winship <danwinship@redhat.com>
Ignore whitespace differences.
Sort the output back into the "correct" order.

Signed-off-by: Dan Winship <danwinship@redhat.com>
Replace the custom HTTP server in StartMetricsServer with MetricServer.

Signed-off-by: Lei Huang <leih@nvidia.com>
A DPU firmware settings change can cause the same physical
port to be re-enumerated under a different PCI address after
a host reboot. Previously, Init() only handled missing device
IDs (legacy annotations). Now it also detects when the
annotated device ID is no longer present in the allocator and
falls back to matching by PfId and FuncId.

Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
`kind setup` starts to fail with error:
```
ERROR: failed to load image: command "docker exec --privileged -i ovn-worker
   ctr --namespace=k8s.io images import --all-platforms --digests
   --snapshotter=overlayfs -" failed with error: exit status 1

Command Output: ctr: content digest sha256:9c04829e9...: not found
```

Related kind issue is kubernetes-sigs/kind#3795.
This change uses the workaround mentioned in the kind issue.

Signed-off-by: Lei Huang <leih@nvidia.com>
fix kind load docker-image content digest not found
Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>
Signed-off-by: Patryk Diak <pdiak@redhat.com>
The node gateway logic was not taking into account dynamic UDN.
Therefore if a UDN was created with a service, but our node was not
active, then at start up during syncServices we would fail due to
GetActiveNetworkForNamespace failing. After 60 seconds of syncServices
failling, it would lead to OVN-Kube node crashing.

This commit introduces a common helper function to network manager
api, ResolveActiveNetworkForNamespaceOnNode, which will allow legacy
controllers that are not per-UDN or default controller to find the
primary network serving a namespace for their node.

The node/gateway is updated to use this function and during sync and
allows us to ignore objects for which the network is not on our node
with Dynamic UDN.

Additionally it does not fail syncServices when a network is not found.
During NAD controller start up, all networks will have been processed.
If by the time gateway starts up and the network is missing, that means
it is a new event which this node has never seen before. Therefore it is
safe to skip it during syncServices and allow initial add handling to
take care of it later.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Network Policy add was not taking into account dynamic UDN. This was not
a problem for the layer2/layer3 UDN controller side, because if the node
was inactive, then the controllers wouldn't exist. However, it was a
problem for the default network controller, because if the DNC could not
get the active network, it would error and retry to add the KNP over and
over again for other UDNs.

This fixes it by checking the nad controller cache instead, which will
always have the full info to determine if the KNP belongs to CDN.

Furthermore, the delete KNP path was incorrect. It would try to get the
active network which could be gone during deletion. This was unnecessary
as the deleteNetworkPolicy code will check to see if it actually
configured it in the first place, making it a noop to always call
delete.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Needed to be updated for the same reasons as network policy. Services
controller is per UDN, and with an inactive node this is not a problem
for UDN controllers as they will not exist. However, for DNC it would
continue failing to get active network here. Use the nad controller
cache and shortcut the checks for default network controller.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Should always just return default network in that case.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
EF calls GetActiveNetworkForNamespace in an initialSync migration
function. This function moves from cluster port group to namespace pgs.
It is old and could be argued to just remove the code, but for now move
to use nad controller cache. Also, do not cause OVNK to exit if we
cannot get the network name, and just skip that entity.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Egress IP controller runs as part of DNC, is event driven, and retries
on failures. It is also not dynamic UDN aware. This commit aims to fix
this by:

 - Change EgressIP to check with nad controller for network presence
 - If network is not processed/invalid skip retrying in egress IP
   controller
 - Register NAD Reconciler for Egress IP, so that when network becomes
   active Egress IP handles reconciliation.
 - If dynamic UDN is enabled, filter out EgressIP operations for
   inactive nodes.

Overall this should be a quality of life improvement to EgressIP and
reduce unnecessary reconcilation with UDN. Future steps will be to break
Egress IP into its own level driven controller.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Adds a test that creates a primary + secondary UDN, pod, egress IP, KNP,
MNP objects in those UDNs. Then restarts every ovnkube-pod, and ensures
it comes back up in ready state. This is useful in general to make sure
we survive restarts correclty, but especially useful for Dynamic UDN
where a network may not be active on a node and we want to ensure start
up syncing is not failing because of that.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
When a pod is recreated with the same name, the egressIP cache could already
contain a “served” {EgressIP,Node} status and skip programming as a no-op.
Since statusMap keys do not include pod IP, LRP/NAT state could remain stale
and traffic would miss egressIP SNAT.

Fix by detecting pod IP drift from podAssignment.podIPs and forcing a
delete+add reprogram for already-applied statuses:
 - compare cached pod IPs to current pod IPs
 - queue existing statuses for reprogram on IP change
 - delete old assignment state (without standby promotion) and re-add it
 - then update cached pod IPs

Signed-off-by: Tim Rozet <trozet@nvidia.com>
EgressIP pod handling assumes pod networking setup has already populated
logicalPortCache before egressIP reconciliation runs. That ordering holds
within one controller queue, but breaks for primary UDNs where pod setup
runs in UDN controllers while egressIP pod reconcile runs in the default
controller.

In that cross-controller race, egressIP reconcile can run first, fail to get
pod IPs (stale/missing LSP), and wait for normal retry cadence even after UDN
later updates port cache.

Fix by wiring an immediate egressIP pod retry on logicalPortCache add:
- add a base controller callback hook for logicalPortCache add events
- invoke it from default/UDN pod logical port add paths
- hook it for primary UDN controllers to enqueue no-backoff egressIP pod retry
- centralize retry logic in eIPController.addEgressIPPodRetry()
  (including PodNeedsSNAT filtering)

This preserves existing behavior while removing the UDN/DNC ordering race
window for egressIP pod programming.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Removes UnprocessedActiveNetwork Error, and moves to just using a single
error, InvalidPrimaryNetworkError for everything. Modifies
GetActiveNetworkForNamespace to return nil when there is no active
network due to namespace being removed, or Dynamic UDN filtering.
Callers can then rely on this function to determine whether or not a
network is active versus the network should exist but doesn't (an
error).

Walked through all callers of GetActiveNetworkForNamespace and
GetPrimaryNADForNamespace and tried to simplify number of calls and
logic.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
- Removes a second call to GetActiveNetworkForNamespace during egress
  firewall add. We can just use the cache object that already exists.

- Restructure the cache object to be a slice of subnets, rather than a
  string key.

- Fix util function CopyIPNets, which was not doing a deep copy of
  the underlying IP/Mask slices.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Code was modifying the annotations of the informer cache node object. If
this was happening while another goroutine was reading the annotation
map, it would trigger ovnkube to crash!

Fixes: #5950

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: Tim Rozet <trozet@nvidia.com>
Gateway egress IP adds IPs to an annotation on the node. The code was
assuming the informer object should have the latest data, then
overwriting the IPs using that information. That isn't reliable as the
informer could have stale data compared to recent kubeclient updates.
This would trigger egress IP logic to corrupt the IPs in the node
annotation, and cause further drift/corruption in subsequent updates.

This fixes it by creating a local cache of IPs for the controller, and
using that as the source of truth, initialized on start up from the node
object. Then updates are driven by what is in the cache, versus what is
in the informer.

Also fixes places where tests should have been using Eventually.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Signed-off-by: fangyuchen86 <fangyuchen86@gmail.com>
EgressIP: Fix crash from mutating node informer object
Fixes missing Dynamic UDN integration, Incorrect logic with GetActiveNetworkForNamespace, adds EgressIP NAD Reconciler
In Egress IP tracker when GetPrimaryNADForNamespace returns an
InvalidPrimaryNetworkError we return nil during the sync, as we expect
the NAD controller to deliver the event later when the NAD is processed.

However, in this UT there is no full NAD controller and it relies on the
lister. Therefore the UT may run before the informer cache is populated
and never get notified from the "NAD Controller". To fix it, wait until
the informer cache is populated and then simulate the NAD Controller
behavior by Reconciling the NAD key.

Fixes: #5953

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Skip namespaces with deletionTimestamp set when selecting target
namespaces, triggering NAD deletion for terminating namespaces.

Signed-off-by: Patryk Diak <pdiak@redhat.com>
@tssurya
Copy link
Copy Markdown
Contributor

tssurya commented Mar 12, 2026

@jluhrsen please add the verified label yourself.. seems like mine isn't taking effect...

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tssurya: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jluhrsen
Copy link
Copy Markdown
Contributor Author

@jluhrsen please add the verified label yourself.. seems like mine isn't taking effect...

it finally worked. and now all of CI is running again so we will have more retests and churn to deal with

@arkadeepsen
Copy link
Copy Markdown
Member

/retest-required

@arkadeepsen
Copy link
Copy Markdown
Member

Re-running e2e-metal-ipi-ovn-dualstack-bgp since other tests apart from [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] are failing.

/test e2e-metal-ipi-ovn-dualstack-bgp

@arkadeepsen
Copy link
Copy Markdown
Member

/test e2e-metal-ipi-ovn-dualstack-bgp

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD eedfcd0 and 2 for PR HEAD eafe465 in total

@arkadeepsen
Copy link
Copy Markdown
Member

@asood-rh @anuragthehatter e2e-aws-ovn-fdp-qe is conistently failing on the following 4 tests. Can you please take a look?

SDN: OCP-12926:SDN pods should be able to subscribe send and receive multicast traffic
SDN: OCP-12930:SDN Same multicast groups can be created in multiple namespace
SDN: OCP-12928:SDN pods should be able to join multiple multicast groups at same time
SDN: OCP-12929:SDN pods should not be able to receive multicast traffic from other pods in different namespace

@arkadeepsen
Copy link
Copy Markdown
Member

Along with

[sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

e2e-metal-ipi-ovn-dualstack-bgp is consistently failing on the following 4 tests due to packet-sniffer daemonset pod not being able to become ready on one node. Giving it one for retry.

[sig-network][OCPFeatureGate:RouteAdvertisements][Feature:RouteAdvertisements][apigroup:operator.openshift.io] when using openshift ovn-kubernetes [EgressIP] Advertising EgressIP [apigroup:user.openshift.io][apigroup:security.openshift.io] For cluster user defined networks When the network topology is Layer 3 UDN pods should have the assigned EgressIPs and EgressIPs can be created, updated and deleted [apigroup:route.openshift.io] When the network is IPv6 [Suite:openshift/conformance/parallel] 
[sig-network][OCPFeatureGate:RouteAdvertisements][Feature:RouteAdvertisements][apigroup:operator.openshift.io] when using openshift ovn-kubernetes [PodNetwork] Advertising a cluster user defined network [apigroup:user.openshift.io][apigroup:security.openshift.io] Over the default VRF When the network topology is Layer 2 Pods should communicate with external host without being SNATed [Suite:openshift/conformance/parallel] 
[sig-network][OCPFeatureGate:RouteAdvertisements][Feature:RouteAdvertisements][apigroup:operator.openshift.io] when using openshift ovn-kubernetes [EgressIP] Advertising EgressIP [apigroup:user.openshift.io][apigroup:security.openshift.io] For the default network Pods should have the assigned EgressIPs and EgressIPs can be created, updated and deleted [apigroup:route.openshift.io] When the network is IPv4 [Suite:openshift/conformance/parallel] 
[sig-network][OCPFeatureGate:RouteAdvertisements][Feature:RouteAdvertisements][apigroup:operator.openshift.io] when using openshift ovn-kubernetes [PodNetwork] Advertising a cluster user defined network [apigroup:user.openshift.io][apigroup:security.openshift.io] Over the default VRF When the network topology is Layer 3 External host should be able to query route advertised pods by the pod IP [Suite:openshift/conformance/parallel]

/test e2e-metal-ipi-ovn-dualstack-bgp

@asood-rh
Copy link
Copy Markdown
Contributor

asood-rh commented Mar 16, 2026

@asood-rh @anuragthehatter e2e-aws-ovn-fdp-qe is conistently failing on the following 4 tests. Can you please take a look?

SDN: OCP-12926:SDN pods should be able to subscribe send and receive multicast traffic
SDN: OCP-12930:SDN Same multicast groups can be created in multiple namespace
SDN: OCP-12928:SDN pods should be able to join multiple multicast groups at same time
SDN: OCP-12929:SDN pods should not be able to receive multicast traffic from other pods in different namespace

It is failing consistently in recent runs.

@arkadeepsen It did pass one time

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe/2027252982126481408/artifacts/e2e-aws-ovn-fdp-qe/cucushift-e2e/build-log.txt

/logs/artifacts/junit:
total 32
/logs/artifacts/serial/junit-report:
total 0
Done. 4 files processed.
cucushift-e2e:
  total: 26
  failures: 0
  errors: 0
  skipped: 2

@arkadeepsen
Copy link
Copy Markdown
Member

@asood-rh @anuragthehatter e2e-aws-ovn-fdp-qe is conistently failing on the following 4 tests. Can you please take a look?

SDN: OCP-12926:SDN pods should be able to subscribe send and receive multicast traffic
SDN: OCP-12930:SDN Same multicast groups can be created in multiple namespace
SDN: OCP-12928:SDN pods should be able to join multiple multicast groups at same time
SDN: OCP-12929:SDN pods should not be able to receive multicast traffic from other pods in different namespace

It is failing consistently in recent runs.

@arkadeepsen It did pass one time

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe/2027252982126481408/artifacts/e2e-aws-ovn-fdp-qe/cucushift-e2e/build-log.txt

/logs/artifacts/junit:
total 32
/logs/artifacts/serial/junit-report:
total 0
Done. 4 files processed.
cucushift-e2e:
total: 26
failures: 0
errors: 0
skipped: 2

Yes. However, since March 10 it has been consistently failing: https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe

Some of the failure in the above link are due to rhcos image issues. But all the latest runs are failing on the above mentioned tests.

@asood-rh
Copy link
Copy Markdown
Contributor

@asood-rh @anuragthehatter e2e-aws-ovn-fdp-qe is conistently failing on the following 4 tests. Can you please take a look?

SDN: OCP-12926:SDN pods should be able to subscribe send and receive multicast traffic
SDN: OCP-12930:SDN Same multicast groups can be created in multiple namespace
SDN: OCP-12928:SDN pods should be able to join multiple multicast groups at same time
SDN: OCP-12929:SDN pods should not be able to receive multicast traffic from other pods in different namespace

It is failing consistently in recent runs.
@arkadeepsen It did pass one time
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe/2027252982126481408/artifacts/e2e-aws-ovn-fdp-qe/cucushift-e2e/build-log.txt
/logs/artifacts/junit:
total 32
/logs/artifacts/serial/junit-report:
total 0
Done. 4 files processed.
cucushift-e2e:
total: 26
failures: 0
errors: 0
skipped: 2

Yes. However, since March 10 it has been consistently failing: https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe

Some of the failure in the above link are due to rhcos image issues. But all the latest runs are failing on the above mentioned tests.

@yingwang-0320 Could you please look at the multicast test failures?

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 16, 2026

@jluhrsen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security eafe465 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jluhrsen
Copy link
Copy Markdown
Contributor Author

Re-running e2e-metal-ipi-ovn-dualstack-bgp since other tests apart from [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] are failing.

/test e2e-metal-ipi-ovn-dualstack-bgp

bug
slack thread

@pperiyasamy
Copy link
Copy Markdown
Member

@tssurya are we good to merge this one now ?

@pperiyasamy
Copy link
Copy Markdown
Member

it seems we need to override e2e-aws-ovn-fdp-qe job which is perma failing.

@arkadeepsen
Copy link
Copy Markdown
Member

/test e2e-aws-ovn-fdp-qe

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 22, 2026

@tssurya: Overrode contexts on behalf of tssurya: ci/prow/e2e-metal-ipi-ovn-dualstack, ci/prow/e2e-metal-ipi-ovn-dualstack-bgp, ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw

Details

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-dualstack
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-bgp/2033526548698501120
failing due to https://redhat.atlassian.net/browse/OCPBUGS-78053 tracker

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack/2032494497375457280 same bug

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/3011/pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack-bgp-local-gw/2032494499896233984 is same bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tssurya
Copy link
Copy Markdown
Contributor

tssurya commented Mar 22, 2026

/tide refresh

@tssurya
Copy link
Copy Markdown
Contributor

tssurya commented Mar 22, 2026

/shrug

@openshift-ci openshift-ci Bot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Mar 22, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 0516832 into openshift:master Mar 22, 2026
32 of 33 checks passed
jluhrsen added a commit to jluhrsen/ovn-kubernetes-1 that referenced this pull request Mar 23, 2026
…026)

This is a manual sync from release-4.22 to release-4.21, excluding PR openshift#3011
which merged today and should not be included in this sync.

Merging up to commit eedfcd0 (Merge pull request openshift#2978)
Excluding commit 96f8acd (Merge pull request openshift#3011 from jcaamano/bz2089392)
jluhrsen added a commit to jluhrsen/ovn-kubernetes-1 that referenced this pull request Mar 23, 2026
…026)

This is a manual sync from release-4.22 to release-4.21, excluding PR openshift#3011
which merged today and should not be included in this sync.

Merging up to commit eedfcd0 (Merge pull request openshift#2978)
Excluding commit 96f8acd (Merge pull request openshift#3011 from jcaamano/bz2089392)
jluhrsen added a commit to jluhrsen/ovn-kubernetes-1 that referenced this pull request Mar 23, 2026
…026)

This is a manual sync from release-4.22 to release-4.21, excluding PR openshift#3011
which merged on March 22, 2026 and should not be included in this sync.

Merging up to commit eedfcd0 (Merge pull request openshift#2978)
Excluding PR openshift#3011 (91 commits from downstream merge 02-27-2026)

Merge conflicts resolved:
- go-controller/pkg/ovn/base_network_controller_pods.go
  * Variable rename nadName -> nadKey
  * Removed duplicate isNonHostSubnetSwitch method declaration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. verified Signifies that the PR passed pre-merge verification criteria ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯

Projects

None yet

Development

Successfully merging this pull request may close these issues.